NLP-enhanced Content Filtering Within the POESIA Project
نویسندگان
چکیده
This paper introduces the POESIA internet filtering system, which is open-source, and which combines standard filtering methods, such as positive/negative URL lists, with more advanced techniques, such as image processing and NLP-enhanced text filtering. The description here focusses on components providing textual content filtering for three European languages (English, Italian and Spanish), employing NLP methods to enhance performance. We address also the acquisition of language data needed to develop these filters, and the evaluation of the system and its components.
منابع مشابه
Text filtering at POESIA: a new Internet content filtering tool dor educational environments
Internet provides to the children an easy access to pornography and other harmful materials. In order to improve the effectiveness of existing filters, we present POESIA, a project which objetive is to develop and evaluate an extensible open-source Internet filtering software in educational environments.
متن کاملText Categorization for Internet Content Filtering
Text Filtering is one of the most challenging and useful tasks in the Multilingual Information Access field. In a number of filtering applications, Automated Text Categorization of documents plays a key role. In this paper, we present two of that applications (Hermes and POESIA), focused on personalized news delivery and Internet inappropriate content blocking, respectively. We are specifically...
متن کاملIntelligent E-Commerce with Guiding Agents based on Personalized Interaction Tools
Project COGITO aims at an agent-based interface for B-to-C applications that is not merely re-active to some user request, but pro-active and capable of engaging in a goal-directed conversation with the user, e.g., by taking the initiative to recommend new products. The approach combines content-based filtering, where user profiles are generated based on content features extracted from document...
متن کاملFeeding OWL: Extracting and Representing the Content of Pathology Reports
This paper reports on an ongoing project that combines NLP with semantic web technologies to support a content-based storage and retrieval of medical pathology reports. We describe the NLP component of the project (a robust parser) and the background knowledge component (a domain ontology represented in OWL), and how they work together during extraction of domain specific information from natur...
متن کاملImproved Document Representation for Classification Tasks for the Intelligence Community
Research within a larger, multi-faceted risk assessment project for the Intelligence Community (IC) combines Natural Language Processing (NLP) and Machine Learning techniques to detect potentially malicious shifts in the semantic content of information either accessed or produced by insiders within an organization. Our hypothesis is that the use of fewer, more discriminative linguistic features...
متن کامل